DOCDHENT BESOHE 

ED 093 985 IH 003 841 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



Follian, John; And Others 

Kinds of Keys of Student Ratings of Faculty Teaching 

Effectiveness. 

tApr 74J 

11p«; Paper presented at the Annual Meeting of the 
American Educational Research Association (59th, 
Chicago, Illinois, April 1 974) 



EDRS PRICE 
DESCRIPTORS 



MF-$0.75 HC-$1.50 PLUS POSTAGE 

College Students; ♦College Teachers; ♦Effective 
Teaching; fieasureoent Techniques; ♦Bating Scales; 
Reliability; ♦Response Mode; ♦Teacher Rating; Testing 
Problems 



ABSTRACT 

Three substudies of effects of different formats on 
student ratings of faculty teaching effectiveness were conducted. One 
substudy investigated Kinds of Keys, Agreement, Evaluation, and Needs 
Improvement. The second, NO TUP, (Hev Observation of Teaching qf 
University Professor Rating Scale) , investigated numbers of positive 
rating categories. The third, Hording, investigated the same items 
vorded positively, negatively, and neutrally, respectively. 
Practically important differences in level of ratings obtained in 
Kinds of Keys, and practically and statistically significant 
differences obtained in NO TUP and Hording. Additional research is 
necessary to determine if apparent differences in teaching 
effectiveness are actually differences in teaching effectiveness or 
differences in the methods of measurement. (Author) 
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There ciirrontlv is conslr^erahln interest Jn collo.f^e ^'acultv, education posit on poucv 



af^t^lnistrators , an'^ v,tudentP in stu^nnt evaluation of t>^e ef ^ectivone^s 

coJ.Ier!;e cours^.n an^ coTlooo processors. '^Mu r^ost commonlv use*! noans 
offobtalninp student c^^t^ nation o^' instruction Is stnVnt ratings. THiile 
tltern has hoon considerable interest In student evaluation thnre h-is not 
»een a corresnondinp: arionnt of research part icnl art v on tlie technical 
ispects of student ratint^ scales. SoiTie of t^^^ toch.n Leal vispects on 
which little research has heen reported are the l^.eys and th»^ ^ornats. 
There has heen little appli(:<l research reported on the effects of 
inds of keys and foms of fornats nsed In ntudcnt ratinr scales o^^ 
faculty teachinp; and couue effectiveness. !iov;cvor there is relevant 
)aslc research on Irinrls of keys and forr^ats in ratine scales in ioh 
performance ratings. Barrett, Tavlor, T^arl:er, and *^lrtens (lOSP^) In- 
vestlcrated fonr fornats: trait names onlvr ^^cr!ial dofinitions of traits: 
f«*i^alt nanes anH liehavioral descriptions hut no definitions; and trait 
finitions and hnaavioral ^lescrint Ions ^»ut no trait nai^es. '^h-^re were 
TOi^ leant differences for formats an^^ for all. iTitoractions involvlnf^ 
^■y^omats. llifdier ratings were associated ''it^^ trait nnm.s and behavioral 
scriptions hut no definitions. In a similar stu^v *\ad*?on and T\otir<1on 



(1964) cxnTnlnod various format fom.^ Incluflln'-' horizontal, vortical, bars, 
no bars, nu'rjbor, and Inl^nls arranf^r^r^onts . ^^>.'i^n tlioro. were slnnl^lcant 
differences for fomat.s and also .ill hitp^n^-rlon:; . It is annnront from 
those basic ratine resf^arch Rt:u'!lcs t^nt Vm^ \M.-r.ii(vlc; of neasuvpTnont , as 
well as thp variaMes measurn'-l, In^lnoncn the levnl ratings awar'^ed. 

The purpose of this pa;)cr is to rrpnrt th^^ results of throo snb- 
studles of l;lnds of kevs for coller;e stuf^pnt ratin^^s collef^,e professors' 
tcachinp; of ^oct ivcnestj . In the firp.t substiidv, ''inr's of Keys, thr; three 
r>aln kinds of keys investi onted : Ap.reeTicnt : Kvaluation; and ^V.cds 

Ir>nrovenent . The reason for this specH'ic substudy wa<^ to deternine 
If the different ratine contexts per so influenced the level of ratings 
awarded. 

In the second sub study, NO T'rp^ four sets of evilnatlv«^ keys ranr.inp 
froT" tx^o ner^ativr, one neutral, t^'o nos itivr;^ to all five positive 
cate«»ories, were investiratod, "^^^ is an ricronvri for the '^ow Obser- 

vation of Teachin.rr of ^^niverslty '^rof '^s.sors ratinf> scale. ^ resnonse 
set that characterizes nanv ratinf> situntions i:< the leniencv (r^ner- 
osity) effect. The leniency effect is rho tcn^Vncy of raters to con- 
sistently assign ratinr^s that are too hi'^h. or'ler to ai^eliorate this 
problem Guil ford iVJ5U) recoTinendnd an uah-ilanco^^ S(^f of catec*ories with 
three positive, one neutral, ar.d one nerrativn rather than a conventional 
set of categories with two posJtivr*, one noutrnl, and t^/o nef^ptive. In 
another milieu invol^^inr: ratinr^^, essav r^rat'in,)^ of Fnrrlish co^^posi tions , 
^oilman and coll':^ar>ues conducted three rolevanl: <^tudics (^oilman and 
Reilly, In press; ^'^ollnan, 1972; Tollnan, Sllvernan, and '"eillv, 1^72). 
In these three analyses the following Mnds of cate^^ories were investir;atpd : 
Numbers CS, 4, 3, ?, 1); Nef>atlve categories (one positive^ one neutral. 




thrfte nG<»atlvo); Conventlonnl (two por,lttve, one nontral, two nof^atlvc); 
and Guilford categories (throo positli^'^, onc^ neutral, one neontivo). 
Across all three stufKos nil sots of cntcforlRs w^re reliable. Across all 
three studies It 'cas conclii'^oH that Mnrls of cntevrorlr'S Influance lovcl of 
ratln^rs, that ne«^ative categories nro<hice th^? iiinhe.st ratln^^s, and that 
positive (Guilford) catefTories prohice the lov;cst ratlpf>s. Thus the 
reason for the second r,peclflc substudy was to determine if different 
conbinations of positive catcrrories would reduce tho leniency error in 
student rating's of instructor tenchinr, effectiveness as they did in 
F.nj^llsh composition scorin*^. 

In the third substudy, Tter^ I7ordlnt» TUrection, the agreement ^'-evs 
used in tha Kinds of Keys substudy were usc^ for the sane set o^ itens 
each sot respectively wor'^ed positively, ner^ntivolv, or neutrally. 
Two basic ratinj^ research studies were identified In which item plirasino 
was varied nositively and nen;atively. ^Hilpnle (1*^'i7) conpareH po*",itivo 
and nej*ative phrasing, in an item x^ri^lno sinHy. T.ictle differences w^^rp 
found between the two forms of phrnslne hut thorn w.ts n tendency for '^true" 
to he f>iven to positively vrorde^^ Ite^is. Tshl'-awa (19^i^) made a nunhor of 
empirical comparisons includinrr one between nf^irmtive statement and 
question statement formats and found few differences. Thus t!\o reason 
for the third specific substudy t^as to dotcrnilne In the context of 
student rating scales if the item wording* tone of different ratine^ sets, 
positive, nCRative, or neutral, vrould influence th.'=; level of rat1n<>s 
awarded. The three stuc^ies are depleted In Table !• 

The objective across all three substudlos was to <^^oteminc if the 
keys per se affected the level of student ratinfzs of faculty teachinf> 
effectiveness. Peliability was also considered, but it was not anticinatcd 
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to be critical because of the f^nbstnnllnl size of each trontment ^,roup 
within each sanplf* vlthin each siiHs*- luly . 

All throe suhstiidles were conducted in ^'^c^^mbcr 197? at the ''iiiversltv 
of South Florida. The S^h for suhstur'v '''I (Kinds of Keys) wore stndonts In 
an iin'lerrraduate finarce course. The Ss for Suhstu'iy (MO tttp) wore 
underj^raduates in another section of the finance course conducted by 
another Instructor. The Ss for Substudy J'^l (I torn Hordlnf? ^Irect. ion) 
were students In an under c>raduate broadcafitinr* course, '^he throe instruc- 
tors were chosen bcceuec they were considered to he characteristic collofre 
teachers. Operationally, this ^enns that they usually r'^celve student 
ratinfT near four on a ''^ive oolnt scale. 

The ratinp, scale for Substudy ^'2 consisted of 17 conventional college 
teachinr: effectiveness ratin^^, itnns developed at: the ^'niversltv of South 
Florida. "^hese itens uere also used for Snhstu^ly ^'1 and Suhsfudy //3, 
In addition, Z^ Items, developed by the ^^niversltv of South Florida 
ColleP:e of Education, V7ere also uspd. 

The students in the class conposin^, each respective substudy were 
randomly assi'>ned to its respective treatn^ent conditions. 

The ratinps were quantified , A, 1, or 1 for the statistical 
analyses as indicated in Tal^lo 1. 

AMO^^A ad-Justed f>roun reliability estimates wore <^eter^ineH for each 
treatment f^roup for each substudy. 

^^eans , standard deviations, and ^^^^^^\'s were comnuted for total of 
items, and individual items, for each ^'ormat to determine the effects of 
each treatrnent format within each substudy on level of ratinps. 

er|c 



Table 2 Indlcater? <^roup r'^liability ostinates, Tnaans a<V|\istc»rl 
to the five point scale, and total score r>cans and JUandarr^ deviations for 
each treatment croup wirltln each si)>)Studv, across all three subsrudlos. 

Initially^ it is apnarent fron Table 2 that all treatment f»roupG across 
all three substudles rated reliably. Even the Iten \7ordinr: Pirectlon Neutral 
group's eatirate^ .79, the lowest, is adequate. It is likelv that tliis 
croup's estimate would have been higher had its si;:e barn bicker • Ton- 
sequently considerable confi(^cnce can be placed in the integrity of each 
f>ronp's ratlnrs as a depr^pdent variable in the three subsequent treatment 
effects analyses. 

The treatment effect analyses will be treated sequeatia'* ly . ^or 
Substudy /^l Table 2 indicates a lovror mean for the ^valuation format 
vis a vis the Agreement and ^^eeHs Imyirovement formats. ^n A^^OVA indicated 
a non-simi f leant F cf 2,?.n for 2 and IOC: de^'rcos of freedom. One-way 
AWVA's for each individual item indicated sif>nif leant differences for keys 
for only one item. Hespite the general non-sioni "leant 

. nature of these 

differences it is suppested that an absolMtc ratines difference of ,23 or .2.5 
(3. 71 vis a vis 3. 94 and 3-96) on a five noint scale wotild not he perceived 
In'Uf forently by faculty of an institution v'i?re sue!" ratln^rs were used 
administratively. This is particular! v trur^ r^u'^.'^ the ^r.valuation format 
is used much more frequently than the *^eeds Improvement format. 'T^lierefore 
it is suprrested that additional research be carried out on this issue usinnr 
a larp,e number and variety of instructors and stvidcnts. 

Examination of the results for Substudy ^^2, ''lO TtJP, Indicates some 
fasciriatlng flndlnrs. Specifically, the adjusted means were A. 11, 3.6^, 
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3,6B, and 3.51, roflpectl v;^ly , for Conventional, f-nrr1en Variety, Nirvana, 
and NO TUP, respectively. The AMO^^\ indlcatod a nlonlflcant (p .05) F 
of 3.15 for these keys. AVO\'A's for InfUvidunl ItcnH e»Tnployln'> n conserva- 
tive level of slpnlflcance indicated three (of 17) i':eTng slpnif leant (p < .05). 
Since the means ranf>e significantly from 3.51 to A. 13 dependinR upon the 
kinds of categories used, and since this ranpe could be extended even riore 
by usinf^ negative catef>ories it is clear that the kinds of cater,ories used 
influence the level of ratlnp:s awarded. The corollary conundrum is t!ic 
Issue of v;hich catefrories to ur.e. 1^ the assumption is made that collepe 
professors in f?:eneraJ are better teachers than teachers in general (the 
focal arp,ument would be that they know more) then so^ne set of cate^^.ories 
employing more positive than nef^ative catec'.orics would probably be 
appropriate. If, on the other hand, the assumption is mde that collerc 
professors are not better than teachers in <reneraJ or else that student 
raters should compare the particular professor v/lth. other professors 
only and not with teachers in f^eneral, then a balanced snt of cntonories 
might bfe appropriate. In any case it is clear that kinds of cater.ories to 
be used in ratinp; instructors is an issue that should be considered seriously. 

F.xanilnation of the results for .^ubstudy Z^, Item Wordin^r direction. 
Indicates means of 3.36, 3.03, and 2. 15; rospoctively, for Positive, Heutral, 
and Tlej^ative categories, respectively. Tlie ANn^:A indicated a hl^;hly sig- 
nificant ( p < .001) F of 19.6 for 2 and 5R do^^rees of freedom for these 
rating set tones. AMOVA*s for iu'^ividual iteps aj^aln uslnf> a conservative 
6if*,nlf icance procedure indicated 21 (of AO) itens sip.nlf leant (p < .05). 
These findinj>s are viewed as additional evidence of the effects of format 
factors in addition to the actual cor.petencc of the instructor beinp, con- 
sidered. It is not considered that these fin<llnfrs are otherwfae important, 

ERIC 



for two reasons. One ronson is that In order to make this suhstudy similar 
structurtilly to the other two substudiep sor\c conceptual Interpretative 
uncertainties were built into tho. conhimf/lon of the individual itens and 
the agreenent format. Secondly, honcfuliy no one will employ a nojvitive 
format. 

OlOUPVTEW 

Overview of these three substudies indicatr,s the followinf> conclusions. 

Initially, it is compellinr^ly clear that kinds of catef*oriej? influence 
massively level of ratinr^s awarded. 

Secondly, it is al55o compellinrlv clear that this source of spurious 
variance will have to be taken into accoi]at in any ad^iinistrative applica- 
tion of student ratings of faculty teaching; effectiveness. The paramountcy 
of this issue ±s evident when it is considered that most faculty fall within 
1.5 ratinps on a conventional five catep,ory scale and that MO TUP alone 
manipulated .62 of a ratinj> unit, almost half of the actual functional ranp.e 
from which to differentiate facultiv, assuming* such an administrative 
object ive. 

Third, additional research is recornncn<led on the kinds of formats 
employed, i.e., F.valuationj Ap.reement, ^'ceds Improvement. Tt appears that 
while there may be some limited differences in lovel of ratlnps awarded it 
mlp,ht be prudent to please faculty by usinrr either the Ar.reement or ^leeds 
Improvenient formats particularly if the ratin<T levels are similar. 

Fourth, both empirical evidence and rational research need to he 
reported on the question of which kind of catep,ories should be used. This 
enipina has philosophical implications on the number system to be used in 
quantifying the data for the statistical analyses. 
O 
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Fifth, while rntlnr, reliability is ^.ot a mattor of concerr* validity 
Is a vital concern. Implicit in the use of atushmt ratinpB is the assumption 
that students are an appropriate, valid piil)llc« This is proh/ihly more of 
a normative question than an empirical one conslflGrlnf> Ihe clinrocterlstic 
correlations of .30 - ./4O between .'Student ratinj;s of teacher effectiveness 
and the students' achievement (Follman, 1972). 

Finally, the pro ronna caveat is noted that the results reported 
herein are veridical to the extent that the three Instructors us3d represent 
collep.e instructors in f^eneral. It is considered that the total item 
rntinp.s of 3.71, A. 13, and 3.36, respectively, for Instructors /'I, ^'2, and 
if2f respectivley , on the most conventional keys are certainly at worst ball 
pai^k figures • The hir',h reliability estiriates provide additional support 
for this interpretation. In anv case it is recommended tliat research be 
conducted on the questions raised herein as they are intepral to any 
administrative application of student ratiups* 
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